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I. ANALYSIS OF VARIANCE AND COVARIANCE 


A. Introduction 


1. Definition. Analysis of variance is a statistical technique which tests mean 
values by a partitioning of the total variance óf a sample iito component parts, each 
of which can be assigned to a particular cause. Thus, in a study of point rainfall, 
several rain gages may be used and several storms measured by each. Within the 
total variance of all measured values of precipitation, there is a portion of the varia- 
tion which is due to the variation of the mean values recorded for each gage and there is 
a portion which is the result of the variation of individual values about these mean 
values. А comparison of the two variances can aid in determining whether any 
measured difference in average rainfall is due to chance. The partitioning of the 
variance in an analysis of variance is determined by the test to be made. 

The partitioning of the variance can include the covariance of the variable being 
studied with another independent variable. Thus the difference in mean values can 
be studied after they have been corrected for the effect of the correlated independent 
variable. 

Before diseussing the analysis of variance, it is well to consider two probability 
distributions whieh play an important role in the analysis. "These are the chi-square 
distribution and the F distribution. 

2. Chi-square Distribution. The distribution of the variance is fundamental 
to many tests of statistical inference. First, consider the distribution of the sum of 
squares of a variable. Let 2, zs . . . , 2, be normally and independently distributed 
variables, each with mean 0 and variance 1. Then 


aot att +++ xni (8-III-1) 
is called chi-square and has the probability density function 


р(х?) = | (x2) exp ( — 


Diy 4 
dii (2) 


where v is called the number of degrees of freedom, and Г represents the gamma 
function, Figure S-III-1 shows this distribution, which is tabulated in most standard 
statisties books for » < 30 (i.e., Hoel [1]). For larger values of y, the quantity (2x2)! — 
(2v — 1)? is approximately normally distributed with mean 0 and variance 1. Since, 
from Eq. (S-III-1), it ean be shown that ns*/o? is distributed like x? with » =n — 1 
degrees of freedom, where s? and s? are the sample and population variances, respec- 
tively, then the sample variance of the x;'s is distributed as x? with n — 1 degrees of 
freedom [2]. 

Equation (S-III-2) is the basis for a test of whether or not a sample variance is 
significantly diff:rent from a presumed population variance. If, for instance, a 
regionalized flood-frequeney curve gives a variance for annual floods of oo*, and if n 
floods for a given station which was not used to determine сц, and is therefore inde- 
pendent of it, have a variance of s, then the ratio of the sample variance to the 
"true," or regional, variance can be tested as 2/04 = x?/n. The critical value of х? 


x‘) (S-III-2) 
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for rejecting the null hypothesis Hy:s? = со? is taken from tables of the chi-square 
distribution at a chosen confidence level with v = n — 1 degrees of freedom. 

Often it is desirable to test the hypothesis that с> = e»? = - - - = oè. Consider 
that a record of annual precipitation is available for n years and that during this period 
of time the location of the rain gage has been moved k times. Thus the homogeneity 
of the rainfall record is questioned. In order to test the hypothesis of the homo- 
geneity of the variance, the rainfall record is divided into k + 1 parts, each being that 
part of the record during which time the rain gage was at à particular location. Let n; 
denote the number of years in the ith segment of the record and s;? the variance of 
the rainfall data in the ith segment, where? = 1, 2, .. . , А +1. Bartlett [3] has 
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Fic. 8-111-1. Critical values of x^. 
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shown that the hypothesis of homogeneous variance can Бе tested by means of the 
chi-square distribution, where x? for k degrees of freedom is defined as 
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If the value of x? computed by Eq. (S-III-3) exceeds the tabulated value of х? for k 
degrees of freedom at a chosen confidence level, then the hypothesis of homogeneity of 
the variance is rejected. It should be noted that the value of e is greater than unity 
and tends to unity as the number of degrees of freedom increases. Hence, the value 
of e may be set equal to unity in Eq. (S-IH-3) unless there is some doubt about the 
significance of х? Ап insignificant value of x; cannot be made significant by using 
the actual value of е computed by Eq. (501-4). 

3. F Distribution. Assume that cz; (i = 1, ...., mi) and y; (j21,...,ms) 
are two independent random samples and that з? and 54? are unbiased estimates of 
the variance for each sample. The question arises whether or not the two samples 
have been drawn from the same normal population having variance g?. Thus it is 
necessary to test the hypothesis gj? = g., 
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The basis for testing the hypothesis sı? = с»? is the ratio of the two sample variances, 
- which is defined as 
Fo. 97 m6 no* (8-IIT-6) 


322 va82* / vog? 
whereby nE puce (S-III-7) 
v» — yweSsst/o? С 
From the previous discussion of the chi-square distribution, it is seen that the numera- 
tor and denominator of the right-hand side of Eq. (S-III-7) are distributed inde- 
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Fic. 8-III-2. Critical values of F at the 5 per cent level. 
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Fie. 8-111-3. Critical values of F at the 1 per cent level. 
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pendently as x2, with »; and v» degrees of freedom, respectively. It can be shown that 
the probability density function of F is 


m? Фәр» pir 


(S-III-8 
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where B denotes the beta function. Figures S-III-2 and $-III-3 show values of F 
corresponding to eumulative probabilities of 0.01 and 0.05 for various values of n 
and vs. These values, which are available in tabular form (i.e., Hoel [1]), facilitate the 
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use of the F distribution for making statistical inferences when analysis-of-variance 
techniques are used. 


В. Analysis-of-variance Models 


1. One-way Classification. Quite often it is necessary to determine whether 
or not an abrupt change in the mean value of some hydrologic statistic, e.g., measured 
precipitation or streamflow, has occurred or whether an appreciable difference in 
rainfall or runoff exists among several experimental watersheds. This test can be 
made through an analysis of variance. 

Assume that there are k drainage areas in a given region and that it is desired to 
determine if the regional runoff can be considered as homogeneous. Let т; denote 
the mean annual discharge per square mile for the ith drainage area (i = 1, . . . , Ё) 
during the jth year of record (j = 1, . . . , n). Thus the data may be considered 
as divided into k classes, with m items in the ith class. The total number of values 


within all the classes is V = X ni, the mean for the values in the ith class is 
i=l 
ni 


gad У ғ 
ni 
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and the mean for the values in all the classes is 


k т 
3s 


i=l j=1 


The total sum of squares of the departures of zi; from 2 may be divided into two 
parts. The first is due to the variability which cannot be explained by differences of 
regional effects between classes. The second is due to the variability between classes 
averaged over the individual items in each class. Thus 


k ni k n: k 
ря Y (2; — 5)! = x Y Gu at V п – 2 (S9) 
i=1j=1 i=1j=1 i=1 


By taking the expectation of both sides of Eq. (S-III-0), it is seen that 
(N — 1)е? = (N — k)o* + (k — 1)c? (8-111-10) 


Thus the three sums of squares given in Eq. (8-III-0) give unbiased estimates of the 
variance when they are divided by the appropriate number of degrees of freedom given 
in Eq. (S-IIT-10). 

The total response z;, of the jth individual in the ith class is assumed to be made up 
of an overall effect x, a part 3; characteristic of the ith drainage area, and a part ei; 
which can be regurded as error. These parts are assumed to be additive, so that 


Tij = и +B’ + ei (S-III-11) 
Е 

where p’ is adjusted (W = u + Bj so that У ви = 0. It is also assumed that each 
i=l 


&; is an independent random normal variate with expectation 0 and variance 0°, 
independent not only of the other e's, but also of the 8’s. 

The hypothesis which is to be tested is that the 85 are all zero, in which ease the 
regional runoff may be considered as homogeneous. The testing of this hypothesis 
may be summarized in an analysis-of-variance table, such as Table 8-111-1. 
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Table 8-III-1. Test of Hypothesis for One-way Classification 


fedi Degrees of Mean F 
Source of variation freedám. Sum of squares square 
Bos A |aw-b 
Between class means..... k—1 А = 2 ni(i — 2)? r1 [pap 
i= 
k т B 
Within classes........... N-—k B- > Y (zi; — 20)? ro. 
ї=1ј=1 7 
k mi с 
ne ae am N-11 | c=) Ў @s-a* | 5 
з i=1j=1 | _ 


If the computed value of F exceeds the value of F with N — k and k — 1 degrees 
of freedom for a chosen confidence level, the hypothesis that the region is homogeneous 
is rejected. It should be noted that В is used instead of С for determining F. On 
the basis of the null hypothesis, C is the most reliable, since it is based on the largest 
number of degrees of freedom; however, B is generally used since it is valid even when 
the null hypothesis is not true. 

A one-way-classification analysis of variance may be used for studying the coefficient 
of skewness of low-flow data based on various durations in order to determine if the 
variation of the skewness between different durations is significant [4]. 

2. Two-way Classification. In the analysis based on a one-way classification, 
the values in each class are considered as replicates of one another, subject only to 
random variation. However, there may be a possible significant variation between 
the individual values in each of the k classes. In order to investigate this possible 
source of variation, assume that the number of items in each class is a constant equal 
ton. Each item in each class corresponds to a given year, where each year is referred 
to as a group. It should be noted that in each of the k classes there is one value from 
each group, and in each of the n groups there is one value from each class. This 
arrangement of the values of the variable being studied holds for all two-way-classifi- 
cation analyses. Thus the data may be considered as divided into k classes and 
n groups. In addition to class means and total mean defined for the one-way-classifi- 
cation analysis, the group means are given by 


k 
i. 
ї= 1 


The total sum of squares of the departures of z;; from 2 may be divided into three 
parts, the first of which is due to the variability between the classes, the second to the 
variability between groups, and the third to error or residual variability. Thus 


2; = 


т1н 


k 
У Ў Gu-23- Y па – 27+ Y М, — 23 


t= 


— 
Б2 


п 
V (2; = % – 2, + £)? (8-01-19) 
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By iaking the expectation of both sides of Eq. (S-III-12), it is seen that 
(nk — 1)а2 = {k — l)e? + (n — 10a? + (nk — n — k + 002 (S-III-13) 
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use of the F distribution for making statistical inferences when analysis-of-variance 
techniques are used. 


B. Analysis-of-variance Models 


1. One-way Classification. Quite often it is necessary to determine whether 
or not an abrupt change in the mean value of some hydrologic statistic, e.g., measured 
precipitation or streamflow, has occurred or whether an appreciable difference in 
rainfall or runoff exists among several experimental watersheds. This test can be 
made through an analysis of variance. 

Assume that there are k drainage areas in a given region and that it is desired to 
determine if the regional runoff can be considered as homogeneous. Let r;; denote 
the mean annual discharge per square mile for the ith drainage area (i = 1, . . . , k) 
during the jth year of record (j = 1, . . . , n). Thus the data may be considered 
as divided into k classes, with p items in the 7th class. The total number of values 


within all the classes is V = b ni, the mean for the values in the ith class is 
ici 
пі 


The total sum of squares of the departures of x:; from Z may be divided into two 
parts. The first is due to the variability which cannot be explained by differences of 
regional effects between classes. The second is due to the variability between classes 
averaged over the individual items in each class. Thus 


k ni k ni k 
у Ў ш-2:= У У (ш -2):+ У ља – а o (uno 
t=1 j=1 i=l j=1 i=l 


By taking the expectation of both sides of Eq. (S-III-9), it is seen that 
(N — l)e? = (N — k)ot + (k — 1)e* (8-111-10) 


Thus the three sums of squares given in Eq. (S-III-9) give unbiased estimates of the 
variance when they are divided by the appropriate number of degrees of freedom given 
in Eq. (8-111-10). 

The total response тү, of the jth individual in the ith class is assumed to be made up 
of an overall effect 4, a part 8; characteristic of the ith drainage area, and a part ei; 
which can be regarded as error. These parts are assumed to be additive, so that 


zu =p! + Bi ey (5-11-11) 
k 
where p’ is adjusted (u’ = д + Bj) so that У в, = 0. It is also assumed that each 
i=1 


s; is an independent random normal variate with expectation 0 and variance о?, 
independent not only of the other e's, but also of the 8’s. 

The hypothesis which is to be tested is that the 8's are all zero, in which case the 
regional runoff may be considered as homogeneous. The testing of this hypothesis 
may be summarized in an analysis-of-variance table, such as Table 5-1-1. 


ANALYSIS OF VARIANCE AND COVARIANCE 8-73 
Table 8-III-1. Test of Hypothesis for One-way Classification 


Source of variation ae Sum of squares ее Р 
: a |AUw-95 
Between class means..... k—I A= 2, ni(i; — z)? ji 136-9 
k т B 
Within classes..........- N-k В = 2, b (zi; — $2)? ЕЕ 
ї=1ў=1 
k т с 
ТЕРИНЕ ОИСИ х-1|с- У У ey- Wo 
ї=1ј=1 


If the computed value of Ё exceeds the value of F with N — k and k — 1 degrees 
of freedom for a chosen confidence level, the hypothesis that the region is homogeneous 
is rejected. It should be noted that B is used instead of C for determining F. On 
the basis of the null hypothesis, C is the most reliable, since it is based on the largest 
number of degrees of freedom; however, B is generally used since it is valid even when 
the null hypothesis is not true. 

À one-way-classification analysis of variance may be used for studying the coefficient 
of skewness of low-flow data based on various durations in order to determine if the 
variation of the skewness between different durations is significant [4]. 

2. Two-way Classification. In the analysis based on a one-way classification, 
the values in each class are considered as replicates of one another, subject only to 
random variation. However, there may be a possible significant variation between 
the individual values in each of the k classes. In order to investigate this possible 
source of variation, assume that the number of items in each class is a constant equal 
ton. Each item in each class corresponds to a given year, where each year is referred 
{о as а group. It should be noted that in each of the & classes there is one value from 
each group, and in each of the n groups there is one value from each class. This 
arrangement of the values of the variable being studied holds for all two-way-classifi- 
cation analyses. Thus the data may be considered as divided into k classes and 
n groups. In addition to class means and total mean defined for the one-way-classifi- 
cation analysis, the group means are given by 


k 
i£ = = y Ti, 
i=l 


The total sum of squares of the departures of z;; from 2 may be divided into three 
parts, the first of which is due to the variability between the classes, the second to the 
variability between groups, and the third to error or residual variability. Thus 


ale 


n k n 
Y Gu-2:- Y па – 2+ У ka; a 
=1 i=l j=1 


k n 
+Y YGc-n-zc-mt (81119) 
i=l] j= 


By taking the expectation of both sides of Eq. (S-III-12), it is seen that 
(nk — 1)0* = ik — 1002 + (n — 1e? + (nk — n — k + 1002 (S-III-13) 
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whereby the four sums of squares given in Eq. (8-I1I-12) provide unbiased estimates 


of the variance when they are divided by their appropriate number of degrees of 
freedom given in Eq. (8-I1I-13). 
The mathematical model is now 
т =н + о; +8; + е (8-11-14) 


where p is the overall effect, æi is the characteristic of the ith drainage area, 8; is 
the characteristic of the jth year of record, and «eij is the error. The overall effect » is 
k n 


considered to be adjusted so that > ai = 0 and Y B; = 0. It is assumed that 


: i=l j=l 
the e's are all normally and independently distributed about zero with the same 


‚ warianee e*. 


The hypothesis which is to be tested is that all the eis and 8 js are zero, in which сазе 
the region may be considered as homogeneous with respect to space and time. The 
testing of this hypothesis is summarized in Table 8-1-2. 


Table 8-1-2. Test of Hypothesis for Two-way Classification 


Source of Degrees of x 
Jarata саара Sum of squares Mean square F 
es eS eee ee 
k 
Between classes... k-—1 А = z n(zi — 2 
i=l 
n 
Between groups. .. n-—1 В = Y k(z; — 2)? 
j=l 
k n c 
ETETEN UE = DS — 1 = 6—2 2 
( din jic pi У es i i + 2) &-D06-D 
i=l j=l 
k n D 
Тоќаї......-.:--. nk — 1 = и И 
o i үл 2 2) 1 
і=1ј= | 


If the first value of F is found to exceed the value of F with (k — 1)(n — 1) and 
k — 1 degrees of freedom, then the hypothesis is rejected that all the a's are zero, 
and if the second value of F exceeds the value of F with (Е — 1)(a — 1) and n — 1 
degrees of freedom, then the hypothesis is rejected that all the g's are zero. If both 
computed values of F are found to be significant, then the entire hypothesis of homo- 
geneity is rejected. 

3. Linearity of Regression. Many hydrologic studies are based on regression 
analyses where generally linear functions are fitted to the data. However, in some 
studies, a linear function may be questionable and it may be necessary to test for 
nonlinearity of the regression line. This test may be made by an appropriate analysis 
of variance. 

The data should be partitioned into k arrays according to the values of the inde- 
pendent variable. The range of values of the independent variable within each array 
should be narrow enough so that the range in the values of the dependent variable in 
each array approximates the spread of the values of the dependent variable about the 
regression line within each array. Let Y; be the estimate of the dependent variable 
from the regression line for the mean value of the independent variable in the ith 
array, let yi be the jth value (jJ = 1, . » n;) of the dependent variable in the ith 
array, let j; be the mean of the values of the dependent variable in the ith array, and 
let ӯ be the mean for all the values of the dependent variable. 
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The sum of squares of (yi; — p, which is proportional to the variance of the 
dependent variable, may be divided into three parts. The first part is due to the 
regression function itself; the second part is due to the deviations of the means from 
the regression line; and the third part is due to the variation within the arrays. Thus 


k ni k k 
Ў Yo -»P-w«mn-9* Y utc = Y) 
i=1 j=l i=l i=1 
k т 
+ Ў Y G6s-90 ША 
i-21j71 
By taking the expectation of Eq. (8-III-15), it is seen that 
(№ — 1002 = о? +-.(8.—-2)° + (N ko? (8-11-16) 


k 
where N = Y т. Thus the four sums of squares given in Eq. (8-III-15) provide 
i=l 
unbiased estimates of the variance when they аге divided by their appropriate number 
of degrees of freedom given in Eq. (S-III-16). Тһе test for linearity is summarized 
in Table 8-1-3. 


Table 8-I1I-3. Test for Linearity 


"T Degrees of Mean 
Source of variation freedom Sum of squares square F 
= — M E mk MONET 
k 
A 
Linear regression. ...--+++--- 1 А = Y ni(Yi — 9)? 1 
i=l 
k 
Deviation of means from . ^ PET B B(N — k) 
regression line leemos 2, ms — Y9 k-—3 |Cck-2 
k ni с 
Within аггауз........+++- N-k С = ў, үз (yi; — 92* P ex 
i=l j=l 
k ni D 
СОНО желеде d GER N-1 t oa wri 
=1ј= 


On the assumption of linearity of regression, the sum of squares denoted by B is 
due to sampling errors and the estimate of о? obtained from B should not be greater 
than that derived from the sum of squares within arrays, which is denoted by C. 
If the computed value of F is found to be significant, then the hypothesis of linearity 
of regression is rejected. 


C. Analysis-of-covariance Models 


1. One-way Classification. At times it is desirable to test the significance of 
the difference in mean values of a variable after these means have been corrected for 
the effect of another correlated variable. ‘Thus a study of the effect of deforestation 
or reforestation проп streamflow must first eliminate the effect of varying precipita- 
tion through the test period. If the data in this example were classified by years, 
the covariance of streamflow with precipitation might be removed and the remainder 
would be variance between years. 
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The covariance first is partitioned into 


à » (zu — 2) (05 D) = 2, Y (ж; — Zi) (Vis — 92 
75 i4 +Y ља – #0 -g) (80-17) 


where the first term оп the right represents covariance within classes (years), and the 
second term that between classes (years). All three terms give an estimate of the 
covariance, assuming there is no difference between years. The analysis of covari- 
ance, as shown in Table 8-III-4, is intended to test the null hypothesis that there is 
no difference in covariance between years and in the population asa whole. Thus our 
null hypothesis would be that reforestation had no effect on streamflow. This type of 
analysis can also be used for studying the significance of changes in the slopes of 
double-mass curves [5]... . - А --» 


Table 8-111-4. Analysis of Covariance 


Degrees of 


Source of covariance eaaa Sum of cross products F 
Between classes k-1 |А = Y ni(i: — z)( — 9) 
(years) i 
Within classes N-k B = (rij — ži) (vi; — i) | (k — 1)B/(N — k) A 
(years) у) У | 
ЛКК N-i1|c-YYG;-205-9 
ioj 


À more exact method of testing the same hypothesis would be to determine the 
significance of the coefficient of regression of streamflow with precipitation within 
years. If this is significant, then the yearly means сап be corrected for the regression 
effect, and the differences between the corrected yearly means then are tested for 
significance. In addition, the significance of the difference between the two regres- 
sion coefficients for between and within years ean be tested to determine whether the 
classification by years has an effect upon the degree of association of the variables. 

The coefficients of regression are given in Table 8-III-5 with the ¢ statistic 


(N — k) 5; У (rij — £)* 


v3 
2:2 (Yi; — 407 
t J 


ine = А (S-III-1S) 


used to test the significance of the within-years coefficient. 

2. Study of Regression Effect. In order to correct the yearly means for the 
regression effect, the variance must be partitioned. First, it is partitioned into that 
due to the regression of streamflow with precipitation and that due to deviations from 
the regression line. The latter part, then, is further partitioned into that within 
years and that between vears. 

The variation due to regression is 


" ^H 
id (rij — Oy; — i | 
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Table 8-III-5. Test of Coefficients of Regression 


Source of covariance 


Between classes (years).......... 


Within classes (years) 


Degrees of freedom 


k—1 
N-k 
N-1 


Coefficient of regression 


t 


Y ni(r; — Z) (ys — 3) 


i Y nts - в 
1 
Xx (rij — zw — 9) 
В = E 7 
5а УУ ш - 29: -> 
i j 
YY (ri; — Ж) (у; — 9) 
Cm d 


j 
22 (zi; — x)* 
t 


a 
J 


The variation between years can be determined in two ways, however, depending upon 
whether the between- or within-years regression coefficient is used as an adjusting 


factor. 


B=) У а -0*- 
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The total variation due to deviations from the regression line is 
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That owing to within-years variation corrected for its regression coefficient is 


[> X (ri; — Ži) (yey — a) |? 


c=) Ут 
i j 


J 
Y У (sg 32) 
=ч 


(8-Ш-21) 


and that due to between-years variation corrected for its regression coefficient is 


[> nf — 2) (Gi — j| 


D = у nij; — y? — 
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У п;(2; — 2)° 


(8-11-29) 


whereas the difference between Eqs. (8-11-20) and (S-III-21) gives the variation due 
to between-years variation corrected for the within-classes regression coefficient. 


The analysis of covariance is given in Table S-ITI-6. 


The within-classes sum of squares divided by its degrees of freedom is the estimate 
of the variance used for testing. 


Fus = 


(N—k—1)A 


Ап F test may be applied first to the regression effect 


(8-11-23) 
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Table 8-I1I-6. Analysis of Covariance 


Source of variation Degrees of freedom | Sum of squares 

TECSSEIDH a эу» qecóset aw exe REAR Er эй» экс Ж 1 A 
Deviations about regression Иїпе.................. N-—2 B 
Wiblinelsesas... а sme Ros rentur E P URS N—k-—1 C 
Between classes based on between-classes 

TBgIURUIUR 1129222 9 aio egg eee йзге» KCN Pree k-2 D 
Between classes based on within-classes 

асе ООРОО SOLVE S дагыыр rar ЙЕН Е—1 B-C 
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then to either or both measures of the between-years variation, 


BE REED. 
Ёк—з,у-®-1\ = EN DE (8-III-24) 
у = ъ= B-C 
and Frase = SLM (8-III-25) 


although the second test is more meaningful, if a class effect exists. To test whether 
the regression coefficients based on within- and between-years variation are signifi- 
cantly different, the difference of the two estimates of between-years variation may be 
used. This variation, which is the result of the difference in regression coefficients, 
may be used with one degree of freedom to compute an estimate of the variance, and 
this tested against the within-groups variance. The resulting F test 


_ (N —k —1)(B =C – D) 


Fiya = © 


(8-111-26) 


may be used to test whether the two regression coefficients are significantly different. 
The null hypothesis tested in this example is that there is no time variation of the 
relation of streamflow to precipitation, the assumption being that any time variation 
which does exist is the effect of reforestation. 


H. ANALYSIS OF TIME SERIES 


A. Introduction 


1. Definition of Time Series. А time series is a sequence of values arrayed in 
order of their occurrence which can be characterized by statistical properties. The 
sequence of values is represented by 2x(é;), x(tz), zlta), . . . , where h <t «t - +: 
The daily hydrograph is a graphical representation of a time series of daily discharges. 
Other examples of hydrologic time series are the annual sequences of floods, low flows, 
and mean discharges. A time series may be a function of time explicitly or a function 
of any single variable which takes the place of time. Examples of sequences ordered 
by distance rather than time are the width and roughness of a stream channel as a 
function of distance. 

Generally, it is possible to classify time series as being either of two types: slationary 
or nonstationary. Assume that a time series is divided into several segments and that 
a statistical parameter such as the mean is used to characterize the data within each 
section. If the expected value of the statistical parameter is the same for each section, 
the time series is said to be stationary. If the expected values are not the same, the 
time series is nonstationary. In stationary time series, absolute time is not important, 
and the series may be assumed to have started somewhere in the infinite past. How- 
ever, in nonstationary time series, it is necessary to consider absolute time since the 
series cannot be assumed to have begun prior to the time of the initial observation. 
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2. Characteristics of Time Series. Most of the statistical methods used in 
hydrologic studies are based on the assumption that the observations are inde- 
pendently distributed in time. The occurrence of an event is assumed to be inde- 
pendent of all previous events. This assumption is not always valid for hydrologic 
time series. Observations of daily discharges do not change appreciably from one 
day to the next. There is a tendency for the values to cluster, in the sense that high 
values tend to follow high values and low values tend to follow low values. Thus 
the daily discharges are not independently distributed in time. The dependence 
between monthly discharges is less than that between daily discharges, and the 
dependence between annual discharges is less than that between monthly discharges. 
Thus the dependence between hydrologic observations decreases with an increase in 
the time base. 

Hydrologic time series may be considered as composed of the sum of two com- 
ponents: a random element and a nonrandom elemeni. ‘A nonrandom element is said 
to exist when observations separated by k time units are dependent. If the values of 
х; are linearly dependent upon the values of z;,;, then the correlation between zi 
and r;,, may be taken as the measure of dependence. This correlation is referred to 
as the kth-order serial correlation. 

The serial correlation coefficient is analogous to the product-moment correlation coef- 
ficient for two sets of data. If х; and т; are considered as two sets of data then the 
kth-order serial correlation coefficient is defined as 


1 N—k 1 N—k N—k 
пае) 
N-k І Y RS ку 

ш y] 
N 


[zx 
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1 
where N is the length of the time series. For k = 0, it follows that ro = 1, and for 
k>1,—-1<r <l. 

If a time series is random, ry = 0 for all values of k > 1. However, for a sample of 
finite size, computed values of r; may differ from zero because of sampling errors. 
Since N is small for most hydrologic sequences, the sampling errors are very large, so 
that it is necessary to test the values of г to determine if they are significantly different 
from zero. А test of significance for гу is given below. 

An example of the computation of the first-order serial correlation coefficient r; for 
the low flows (annual minimum daily discharges) for Middle Branch Westfield River 
near Goss Heights, Mass., is shown in Table S-III-7. The period of record is from 
1913 to 1950 (V = 38). In the table, the columns headed z; and ж give the low-flow 
values from 1913 to 1949 and from 1914 to 1950, respectively. 

In order to determine rs, it is necessary that z; and z;,» denote the low-flow values 
from 1910 to 1953 and [rom 1912 to 1955, respeetively. Similarly, by forming two 
sets of data, the values of rs, ra ete., can be determined. 

3. Properties of the Nonrandom Element. The nonrandom element may be 
composed of both a trend, or a long-term movement, and an oscillation: about-the 
trend. Both of these parts need not be present in a particular time series. The 
first step in analyzing a time series is to separate the nonrandom element from the 
random element. 

Trend is usually thought of as a smooth motion of the series over a long period of 
time. For any given time series, the sequence of values will follow an oscillatory 
pattern. If this pattern indicates a more or less steady rise or fall, it is defined as a 
trend. However, no matter what the length of a time series is, it can never be stated 


(8-ITI-27) 
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Table 8-111-7. Computation of the First-order Serial 
Correlation Coefficient 


Ti Ti+ z? Iti! Tet 
———— ee Draco e tem 
1.6 0.4 2.56 0.16 0.64 
0.4 0.4 0.16 0.16 0.16 
0.4 2.9 0.16 8.41 1.16 
2.9 5.4 “8.41 29.16 15.66 
5.4 5.0 29.16 25.00 27.00 
5.0 7.5 25.00 56.25 37.50 
7.5 5.0 56.25 25.00 37.50 
EU CT ou 25.00- = 196 70.00 
14 15 196 225 210.00 
15 2.5 225 6.25 7.50 
2.8 3.0 6.25 9.00 7.50 
3.0 9.1 9.00 82.81 27.30 
9.1 4.0 82.81 16.00 36.40 
4.0 6.8 16.00 46.24 27.20 
6.8 14 46.24 196 95.20 
14 4.0 196 16.00 56.00 
4.0 4.7 16.00 22.09 18.80 
4.7 4.8 22.09 23.04 22.56 
4.8 2.1 23.04 4,41 10.08 
2.1 4.6 4.41 21.16 9.66 
4.6 6.0 21.16 36.00 27.60 
6.0 5.5 36.00 30.25 33.00 
5.5 2.5 30.25 5.25 13.7 
2.5 6.9 6.25 47.61 17.25 
6.9 10 47.61 100 69.00 
10 | 2.6 100 6.7! 26.00 
2.6 4.6 6.76 21.16 11.96 
4.6 2.5 21.16 6.25 11.50 
2.5 4.4 6.25 19.36 11.00 
4.4 4.5 19.36 20.25 19.80 
4.5 4.8 20.25 23.04 21.60 
4.8 11 23.04 121 52.80 
11 8.8 121 12.25 38.50 
3.5 3.6 12.25 12.96 | 12.60 
3.6 2.6 12.96 6.76 | 9.36 
2.6 1.8 6.76 3.24 4.68 
1.8 за 3:9 ... 3.24 12.96 6.45 
рә 103.6 | 195.6 1,433.54 1,494.24 1,234.70 
Z/N — 1) 5.23 | 5.20 40.10 40.35 33.37 


= B= (6.23529) - 945 
1010 — (5.23)]40.38 — 6.278 ý 


with certainty that an apparent trend is not part of a slow oscillation, unless the series 
ends. 

An oscillatory pattern is often confused with a cyclical pattern. For a cyclical 
time series, the maximum and minimum values occur at equal intervals of time with 
constant amplitude. The random element, if present, tends to distort this pattern. 
In an oscillatory time series. the amplitude and the interval of time between maximum 
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and minimum values are distributed about mean values. A cyclical time series is 
oscillatory, but an oscillatory time series is not necessarily cyclical. 


B. Trend Analysis 


1. Use of Moving Averages. Various methods of removing trend are available. 
All the methods, however, are not fully understood as to how they affect the time 
series. The most general method involves the fitting of a polynomial to the data. 
This method has two principal objections: (1) the coefficients of the polynomial must 
be defined by high-order moments which are unreliable because of their large sampling 
errors since N is small, and (2) the coefficients of the polynomial must be recomputed 
each time a new value is added to the time series because they are based on the 
available data of the time series. "e E e. 

An alternative method of trend elimination is that of moving averages, which con- 
sists of finding a polynomial which will fit part of the record and using different 
polynomials for different parts of the record. This method permits the addition of 
new values without altering the previously fitted polynomials. 

In order to remove the trend, it is necessary to smooth out irregularities in the time 
series. Assume that the observations z;, zs, . . . , zw are taken at equal intervals of 
time. The method of moving averages consists of determining overlapping means of 
m successive weighted values. An example of moving averages of m — 3 is 


bixi + bere + biz: 
y= > oe ee 


E biza + Бэт; + biz 
үк 3 (S-III-28) 


bizxv-s + Бәтуа + Әү 
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The weights of the moving average, bı, b», and b;, are such that their sum equals 3. 
In general, for moving averages of m, 


m 
p b; =m (S-I1I-29) 


ї=1 


The weights may be either positive or positive and negative. А simple moving 
average refers to the case where each of the weights equals 1. Although a simple 
moving average tends to smooth out the data, it does not preserve the main features 
of the time series as well as a weighted moving average. 

It is convenient to use odd values of m so that the computed values of y correspond 
in time to the middle value of the z's being averaged. А moving average of m applied 
to a sequence of У terms yields a sequence of N — 2n terms, where п = (m — 1)/2. 
Thus, if m = 3, n = 1, so that one term is lost at the beginning and end of the time 
series. Although it is possible to use moving averages of m = 2, 3, ... , № — 1, 
it is necessary that m be small relative to N. 

Generally, even а smooth trend obtained by the method of moving averages cannot 
be represented conveniently by a mathematical equation. If a mathematical trend 
is fitted to the data, a simple relation should be used unless logie indieates-otherwise. 
The simplest mathematical expression is a straight line. However, a time series is 
apt to be such that a single linear trend cannot be used throughout the time of observa- 
tion. In such cases, it is possible to use linear trends for portions of the time series. 

After a trend has been established, it is possible to remove the trend from the data 
in one of several ways. One way is to take as a new variable the deviations about the 
trend line. Tt is necessary that these deviations constitute a stationary time series. 
This procedure of trend removal is widely used in hydrologie studies. With some 
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time series, such as tree-ring series, the deviations of the data about the trend line 
decrease with time [6]. Hence, in this сазе, dividing the deviations by their cor- 
responding trend values yields a series which may be considered as stationary. 

As an example of trend analysis, simple moving averages of m = 3 and m = 5 are 
applied to the low-flow data for the Middle Branch Westfield River near Goss Heights, 
Mass. These results are shown graphically in Fig. 8-III-4. Both moving averages 
indicate an apparent trend. This apparent trend may, however, be part of an oscil- 
latory movement. With such a short series, it is difficult to prove that the apparent 
trend is significant, and not part of the oscillatory movement of the series. 

2. Slutzky-Yule Effect. Assume that a time series has an oscillatory component 
about a trend. If a moving average is used to determine the trend, a long-period oscil- 
lation tends to be included as part of the trend. Oscillations which are comparable in 
period with the length of the moving average m, or even shorter, are damped out. 
The moving average also introduces an. oscillatory movement into the random element. 
These consequences of the moving-average method are referred to as the Slutzky-Yule 
effect [7, 8]. 


l6 


Legend 
— Observed low flow 
---- Simple moving average of 3 


SACRO AYN 
eer ee 


— m 0 mo О oO KR Oo ba] io 
BESBRBSBEBÓBS5ESÉRRÉSài 


191 
191 
1917 
1919 


Fig, 8-III-4. Annual daily low flow for Middle Branch Westfield River near Goss Heights, 
Mass. 


If a simple moving average is used, the variance of the induced oscillation is 1/m 
times the variance of the random element, and the average length of this induced 
oscillatory movement is 360?/cos [(m — 1)/(m + 1)|. If a weighted, instead of a 
simple, moving average is used, the Slutzky-Yule effect is magnified. Because of the 
Slutzky-Yule effeet, care must be exercised in discussing the oscillatory character of a 
time series if its trend has been removed by means of the moving-average method. 


C. Tests for Serial Dependence 


1. Parametric Test of Significance. The variables x; and zi; used to determine 
the serial correlation coefficients are actually parts of the same time series. The serial 
correlation coefficients cannot be tested for significance by means of the test for the 
ordinary product-moment correlation between two random series unless N is very 
large. A reliable test of significance must be based on small-sample theory. 

Anderson |!)| developed a test of significance based on а normal random time series 
which is circular. A circular time series is defined as a time series where the last value 
is followed by the first value so that the series repeats itself. As V tends to infinity, 
the confidence limits based on a eireular time series converge to those based on an open 
time series. If У is small, only the low-order (А small) serial correlation coefficients 
may be tested for significance. Blackman and Tukey [10] recommend that k/N 
should not exceed 0.10. This rule appears to be satisfactory for deciding upon the 
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highest-order serial correlation which may be tested for significance by Anderson's 
method. 

With respect to the first-order serial correlation coefficient rı, Anderson showed that, 
for a normal random time series of N values, the expected value and the variance of 
rı аге —1/(N — 1) and (N — 2)/(N — 1), respectively. Since rı is nearly normally 
distributed, the confidence limits (CL) for a computed value of rı are given by 


ре 1 УМ – 2 
CL(ri) = N-I1 + te р (8-11-30) 


where ё. is the standardized normal variate corresponding to the probability level 
=a 

If the computed value of r, lies within the confidence limits, then r, is considered to 

. be insignificantly different from zero at the probability level 1 — o. _ An insignificant 
rı is а necessary, but not a sufficient, condition for deciding that a time series is ran- 
dom. In order that a time series be regarded as random, it is necessary that гь for 
k > 1 be insignificant. Because of sampling errors, the serial correlation coefficients 
for some values of & will be found to be significant even if the observed time series is 
a sample from a random time series. However, the number of significant serial cor- 
relation coefficients should not be greater than that expected by chance from the total 
number of serial correlation coefficients tested. The reader is referred to Anderson's 
paper [9] for tests of significance of r's where k > 2. 

For the low-flow data for Middle Branch Westfield River near Goss Heights, Mass., 
ri is 0.45. At the 95 per cent level, б = 1.96. Thus the 95 per cent confidence 
limits are 0.30 and —0.34. Since гу exceeds the upper confidence limit, it may be 
regarded as significant at the 95 per cent level. 

2. Nonparametric Tests of Significance. А nonrandom time series has an 
oscillatory component. The observed values in a purely random time series fluctuate 
erratically about some mean value. "The fact that a time series exhibits more or less 
erratic fluctuations suggests that the number of times that the values are above or 
below a given value is indicative of the randomness or nonrandomness of the time 
series. 

A nonparametric method of determining if a time series is random is that of the 
median cross. For a time series of N values, the median is determined. From the 
sequence of V values, the number of times that the series crosses the median is deter- 
mined. Let this number be denoted by n. The expected value and variance of n are 
(N — 1)/2 and (N — 1)/4, respectively. Since n is nearly normally distributed, it is 
possible to test if n is significantly different from (N — 1)/2 by 


ja A (8-11-31) 
У UY — 


Tf the absolute value of ¢ is greater than the absolute value of ta, the normal deviate at 
the probability level 1 — а, the time series is regarded as nonrandom. If the con- 
trary is true, the time series is considered to be random. 

Another nonparametrie method for determining if a time series is random is the 
turning-point lest. А turning point is associated with a value z;, where either x41 > 
T; X ria or жы < r; > or; Let m denote the number of turning points. The 
expected value and the variance of m are 2(N — 2)/3 and (16N — 29)/90, respec- 
tively. Since m is nearly normally distributed, 


pa MIAN =D (S-I1I-32) 
vaev — 29)/90 


The significance or nonsignificanee of m is determined in the same manner described 
for n in the median-eross test. By using the median-eross and the turning-point tests 
to determine if the Middle Branch of Westfield River data are random, the / values are 
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0.82 and 0, respectively. Since both t values are less than 1.64, the data are assumed 
to be random. It is interesting to note that the nonparametric tests indicate random- 
ness and the parametric test indicates nonrandomness. The parametric test is 
stronger and more reliable than the nonparametric tests. 


D. Generating Processes 


1. Definition. A generating process is the manner by which the causal forces act 
to produce a time series. Some processes can be expressed mathematically, in which 
case it is possible to determine directly the various statistical characteristics of the 
time series. Often a time series is approximated by a certain process. The choice 
of the process is based upon how well the mathematical structure of the process 
conforms to the physical characteristics of the time series. The processes which 
‘have beén “used in hydrologic studies are (1) the moving average, (2) the sum of 
harmonics, and (3) the autoregression. 

2. Moving-average Process. The moving-average process may be expressed as 


ж = bo + biyi + bsyica + + -© + baYi-on—n (8-11-33) 


where у is а random variable and m is the extent of the moving average. Equation 
(8-11-33) may be taken as the model representing the relation between annual runoff 
z and annual effective precipitation y, where m is the extent of the carryover due to 
the water-retardation characteristics of the river basin. For such a model, the 
weights bo, b, . . . , b, must all be positive and sum to unity. By virtue of the 
moving average on the y's, the generated series x is not random. The serial correla- 
tion coefficients for the z's are given by Wold [11]: 


m 
2 bibi, 
m-9—— 0<Е<т—1 (8-II1-34) 
$e 
i-o 
where ть = 0 km (8-II1-35) 


It should be noted in Eqs. (8-111-34) and (S-III-35) that dependence between values 
does not extend throughout the time series. Values separated by m or more time units 
are independent. 

3. Sum-of-harmonies Process. A simple model of the generating process of 
the sum of harmonics is 

a = Asin bi + yi (8-111-36) 


where A and # are the amplitude and period of cyclicity, respectively, and y is а ran- 
dom component. Equation (8-11-36) may be taken as a model representing seasonal 
discharges. For example, if the z's denote monthly discharges and if there is а dis- 
tinet period of high flow and of low flow, then 0 = 7/6, and ї would represent the 
months from 1 to 12. The generated x’s are nonrandomly distributed in time. The 
serial correlation coefficients are defined by 


re = LL cos 0k (8-11-37) 

2 Var (x) 

where the variance of z, Var (x), is defined by 
Var (z) = + Var (y) (S-ITT-38) 


Equation (8-111-36) is a special case, where only one harmonic is involved. It is 
often argued that there are hidden periodicities in hydrologic data of annual sequences. 
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The periodicities are called hidden since the superposing of many series of different 
harmonics yields a series which is seemingly random. A recent study involving the 
search for hidden periodicities in rainfall has been made by Abbott [12]. 

4. Autoregression Process. An auloregression process is used in hydrologic 
studies for representing sequences whose nonrandomness is due to storage in the basin 
(groundwater, lake, or channel storage). There are many autoregressive models; 
however, the first-order process is defined as 


Tia = ты + еза (8-I11-39) 


where 7; is the first-order serial correlation coefficient for the z's and eis a random com- 
ponent. This process is often referred to as the first-order Markov process. For this 
process, the serial correlation coefficients are given by 


Tk = rik (8-ITI-40) 
If rı is positive, then all values of т> are positive and rı >т, > +--+. If ris nega- 


tive, then rz is positive for even values of k and negative for odd values of k. The 
absolute value of r; decreases as k increases. 

5. Correlograms. A correlogram is a graphical representation of the тг as a 
function of k where the values of ғ; are plotted as ordinates against their respective 
values of k as abscissas. In order to reveal the features of the correlogram better, 
the plotted points are joined each to the next by a straight line. 

From Eqs. (S-III-34) and (8-III-35), it is seen that the correlogram for a moving 
average may oscillate, depending upon the b's, but it will vanish for all values of 
k » m. It is seen from Eq. (S-III-37) that the correlogram for a harmonie process 
will oscillate with period @ and amplitude 42/2 Var (т). The period of oscillation of 
the correlogram is the same as that for the time series itself. For the autoregression 
process, it is seen by Eq. (8-III-40) that, if rı is positive, the correlogram will decrease 
monotonically from ro = 1 tor, = 0. Tfr: is negative, the correlogram will oscillate 
with period unity above the abscissa with a decreasing but nonvanishing amplituc e. 

The correlogram provides a theoretical basis for distinguishing among the three 
types of oscillatory time series. From a set of data, the serial correlation coefficients 
can be determined and the correlogram can be constructed. The shape of the correlo- 
gram is indicative of the generating process in the manner described above. In 
practice, the number of observations forming a sequence is small, so that observed cor- 
relograms always show less damping than theoretical correlograms because the 
observed serial correlation coefficients are inflated by sampling errors. Thus one can- 
not easily discern what the generating process is simply by observing the correlogram. 

At present there is no adequate small-sample test for distinguishing among the 
generating processes. Quenouille [13] has developed tests of significance of the cor- 
relograms for various autoregressive models. However, these tests are based on the 
length of sequence being very large. 


E. Effect of Serial Correlation 


1. Estimation of the Variance. Serial correlation represents a tendency for 
fluctuations about the mean to perpetuate themselves. In nonrandom hydrolcgic 
time series, r usually is positive, so that high values tend to follow high values and low 
values tend to follow low values. Thus values near т; yield little new information 
concerning the true fluctuation of the events about the mean. The amount of 
information which is furnished varies inversely with ri. If r; = 0, then each suc- 
cessive event furnishes new information. If rı = 1, then each event contains all 
the available information, so that each successive event furnishes no new information. 
Thus, for a given nonrandom time series of length V, the information given by the .V 
values is equal to that given ћу a random time series of length N’, where № < N. 
N' is often referred to as the effective length of record. With respect to a given value of 
N, the larger г is, the smaller №” is. 
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For a sequence of N’ events, taken from a random time series, the unbiased esti- 
mator of the variance is given by 


S? (8-111-41) 


(ж: — "| 2> N' 
N'—1 


N* 
a — [ 
N' ыы 1 / N' 


so that the expected value of ô? is о. For a sequence of N events, from a time series 
generated by a first-order Markov process, the unbiased estimator of the variance is 


given by 


N' 
i=l 


rı? 2г\(1 sye n 
gals oe +e 2 111-42 
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If rı = 0 and N = №, Eq. (8-III-42) reduces to Eq. (8-IMI-41). By equating Eqs. 
(8-11-41) and (8-111-42), it is possible to deterniine N’ Tor given values of rı and N. 
A graphical procedure facilitates the determination of N'. In Fig. 8-III-5 a family 


of curves is shown for 52/22 versus N as a function of rı. As N tends to infinity, 
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Fic. 8-11-5. Relation between 52/9 and N as a function oí ri. 


52/22 tends to unity for all values of rı» The larger ri is, the slower is the rate of con- 
vergence. For a given sequence, N is known and r; ean be determined by Eq. 
(8-01-27). Thus, starting with the value of М on the abscissa, a vertical line is 
drawn upward until it intersects the curve corresponding to ri. From this point of 
intersection, a horizontal line is drawn to the left until it intersects the curve for rı = 0. 
From this point of intersection, a vertical line is drawn downward to the abscissa 
scale to determine №. An example is shown in Fig. 8-III-5. It is assumed that 
N =30 and г; = 0.4, so that № = 12. 

2, Correlation and Regression Analyses. In hydrologic studies, one is often 
interested in whether or not two or more variables are associated (see also Sec. 8-11). 
Extensive theory has been developed for determining the degree of association by 
means of the correlation coefficient when each variable is randomly distributed. The 
correlation between two nonrandom time series can be determined, but cannot be 
tested for significance in the same manner as the correlation between two random 
variables. If N pairs of observations are available, each observation cannot be con- 
sidered as contributing new information about the correlation if the two time series 
are nonrandom. 

The test of significance for the correlation between two random variables is based 


on the / test [2], where 


x (8-111-43) 
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Where r denotes the correlation between the two variables, and n, which is equal to 
N — 2, where N is the number of pairs of observations, denotes the number of degrees 


of freedom. 
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In order to test the correlation between two nonrandom time series for significance, 
it is necessary to replace n by the effective number n' of degrees of freedom. From 
Bartlett’s work [14] it can be shown that, for very large sample sizes, 


/ (8-11-44)! 
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where rı, rs, . . . are the serial correlation coefficients for one of the time series and 
rj, ry, . . . are the serial correlation coefficients for the other time series. To 
determine n’, it is necessary to compute the serial correlation coefficients for many 
orders. This is very laborious, and because of the sampling errors associated with 
the serial correlation coefficients, it is not possible to determine n’ accurately. 

A useful formula for n’ is obtained by considering each of the time series to be 
generated by a first-order Markov process. For this process, гь = ri* and rj! = (ғ). 


By using these relations-ic Eq. (8-ITI-44), n’ becomes "M М М 
ўы 1 = ту! 
ligi тече = (8-11-45) 


By Eqs. (8-11-44) and (8-11-45), it can be seen that if either time series is random, 
n' =n. This is consistent with the fact that each observation of the random time 
series contributes completely new information on the value of the correlation between 
the random and nonrandom time series. 

Regression analysis is used in hydrologie studies to establish the relation between a 
given variable (referred to as the dependent variable) and one or more variables 
(referred to as the independent variables). The classical theory of regression analysis 
is based on the assumption that each variable is randomly distributed. If both the 
dependent and independent variables are time series, it is necessary to determine 
if the variables are random or not. An ordinary regression analysis involving time 
series is valid under two conditions: (1) if either the dependent or the independent 
variables are random, and (2) if the deviations from the line of regression are serially 
independent. 

The serial dependence of the deviations from the line of regression may be deter- 
mined by means of serial correlation coefficients. However, in testing the serial 
correlations of the deviations for significance, it is necessary to allow for the fitting 
of the regression line. No exact test of significance is available. An approximate 
test of significance is given by Durbin and Watson [15]. This test, summarized in 
Table S-III-8, is based on giving correction terms for determining the effective number 
of deviations. 


Table 8-III-8. Corrections to Number of Observations 
е лл —————————— 


| Level of significance 
Number ot independent | ^ — —— — ———— 
variables | 
Р = 0.05 P = 0.02 
1 (—1) (20) (—1) (16) 
2 (—5) (35) (-5 (30) 
3 ( —10) (60) (—10) (50) 
+ (—15) (100) (—15) (75) 


——— —À—— 


_ Table S-III-S is used in the following manner. Assume that a regression analysis 
involving two time series is based on У = 20 pairs of observations. The serial cor- 
relation of the deviations from the line of regression may be tested for significance by 


1 See also Eq. (8-11-27). 
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Eq. (8-III-43), where N = 20 — 1 = 19 and N = 20 + 20 = 40. At the 95 per cent 
confidence level £ is approximately 2. Thus, for N = 19, r = 0.46, and for V = 40, 
т = 0.31. If the computed serial correlation is greater than r = 0.46, then the serial 
correlation is significantly greater than 0 at the 95 per cent level. A computed serial 


Table 8-111-9. Data for Studying the Effect of Serial Correlation on 
Correlation and Regression Analyses 


Year x 
1913 І 
1914 29 
1915 20 
1916 45 
1917 30 
1918 32 
1919 24 
1920 25 
1921 31 
1922 33 
1923 20 
1924 18 
1925 16 
1926 14 
1927 20 
1928 43 
1929 20 
1930 15 
1931 14 
1932 15 
1933 21 
1934 23 
1935 29 
1936 20 
1937 26 
1938 24 
1939 22 
1940 25 
1941 14 
1942 14 
1943 19 
1944 21 
1945 41 
1946 39 
1947 30 
1948 19 
1949 1 
1950 21 
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staan. | Cav» —3.03 
13.69 —0.32 —24.£4 
12.34 0.22 —4.03 
38.04 2.72 —6.53 
29.34 4.09 —1.03 
21.56 2.55 —1.83 
12.86 5.23 2.21 
16.65 1.60 —0.43 
22.30 11.7 7.37 
22.21 8.66 7.97 
8.52 —4.30 —1.93 
11.04 1.87 —1.03 
9.74 FT 5.47 
8.43 —0.12 0.77 
15.13 4.99 2.37 
36.04 10.92 4.97 
5.04 — 2.34 — 0.43 
8.04 2.39 | 1.27 
8.78 2:67 1.57 
10.13 —60.07 | —1.33 
15.78 3.65 | —0.03 
15.69 3.92 | 0.97 
21.00 2.73 | —0.73 
9.91 0.01 | —1.93 
19.04 5.77 1.27 
14,95 6.87 4.77 
13.65 —1.93 | —2.93 

i I 
17.34 | 4.49 | -—0.83 

5.30 0.42 —0.73 
9,13 3.27 £17 
14.13 | 2.51 0.27 
14.39 | 2.76 0.17 

| | 
33.69 8.83 | 
24.73 —1.48 
16.43 | 2.01 

8.56 0.97 
11.39 0.62 
EB 2.78 —1.03 


correlation is nonsignificant if it is less than r = 0.31. 


If a computed serial correla- 


tion lies between these two values, then there is doubt about the significance at the 
95 per cent level, since it is not certain if the 95 per cent level is reached [12]. 

lf the residuals are serially uncorrelated, an ordinary regression analysis is valid. 
However, if the residuals are serially correlated, it is necessary to take this faet into 
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account in the regression analysis. Quenouille [13] suggests that this may be done in 
one of two ways. The first way consists in calculating the deviations from -a serial 
regression of the dependent variable upon previous values of itself and using these 
deviations in а regression analysis on the independent variables and their previous 
values. The second way is to make a regression analysis of the dependent variable 
upon the independent variables and upon previous values of the dependent variables 
and itself. The first method may be used to predict the random variation in the 
dependent variable from the independent variables. By the second method, values 
of the dependent variable may be estimates from the independent variables and 
previous observations. 

In order to clarify the above discussions an example is given. In Table 8-III-9, 
under the columns headed by z and y, respectively, the correlation between z and 
y, r(zy), is 0.47, and the first-order serial correlations for the т and y series are 

- m(t) = 0.453 and rly) = 0.348, respectively. By assaming that both series are 
generated by a first-order Markov process, then the effective number of degrees of 
freedom is, according to Eq. (8-III-45), 28. By using Eq. (8-III-43), t = 3.08. 
Since the value of ¢ at the 95 per cent level, 2.056, is less than 3.08, the correlation 
between z and y is significant. 

The equation for the regression of y on = is 


y = 0.31 + 0.20z (8-ITI-46) 


The deviations from this regression are given in Table 8-III-9 under the column 
headed by у’. The first-order serial correlation of the deviations, r;(y'), is 0.333. 
By using the corrections, given in Table 8-111-8, to the number of observations and 
applying Eq. (S-III-43), it is seen that r,(y/) is significant at the 95 per cent level. 
Since the т, y, and y’ series are nonrandom, an ordinary regression analysis cannot be 
made. 

The deviations from the serial regression of the independent variable upon previous 
values of itself are given by 


Zia — 0.4532: = (ex) iar (8-III-47) 


These deviations are given in Table 8-III-9 under the column headed by ez. Similarly, 
the deviations from the serial regression of the dependent variable upon previous 
values of itself can be determined. These deviations are given in Table 8-III-9 
under the column headed by e, The first-order serial correlation coefficients for 
these two sets of deviations are rle) = 0.226 and r(e) = —0.176. Both of these 
coefficients are insignificant at the 95 per cent level. Thus the e, and e, series may be 
considered as random. 

If the first method of accounting for the serial correlation is used, it is necessary to 
determine the regression of e, on є. This regression gives 


(ej), = —0.782 + 0.282(є„)‹+1 (5-11-45) 
so that y;., might be predicted using 
Yet, = —O.782 + 0.348y; + 0.232; — 0.1052; (S-ITI-49) 


If the second method is used, a multiple regression of угы on yi, жы, and x; must be 
carried out. This regression gives 


Jig = 1.210 + 0.6025; + 0.2442. — 0.2052; (8-I11-50) 
In order to use either Eq. (8-11-49) or Eq. (8-11-50), it is necessary that the devia- 
tions from the line of regression be serially independent. 
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